Customer Churn Analysis

Source: Voxco.com
Source: Voxco.com

Introduction

The goal of this analysis is to understand the factors influencing customer churn in a telecom dataset. We aim to explore and visualize various features, inspect patterns associated with churn, and assess the relationships between churn and various demographic, service-related, and financial characteristics. By doing so, we aim to identify key factors contributing to churn and leverage these insights in building predictive models that can help in customer retention efforts.

The analysis starts by examining the dataset structure and handling missing values, followed by visualizing data distributions across several categorical and numerical variables. Subsequently, machine learning models are employed to predict churn, with performance comparisons among various classifiers, including K-Nearest Neighbors (KNN), Support Vector Classifier (SVC), Random Forest, Logistic Regression, and Decision Tree models.

Libraries

library(tidyverse)  # for data manipulation and visualization
library(visdat)     # for visualizing missing data
library(caret)      # for machine learning workflows
library(rpart)      # for decision trees
library(randomForest)
library(e1071)      # for naive bayes and svm
library(class)      # for knn
library(kernlab)    # for svm
library(nnet)       # for neural networks
library(ada)        # for adaboost
library(gbm)        # for gradient boosting
library(xgboost)
library(pROC)       # for ROC curves
library(plotly)
library(gridExtra)
library(ggplot2)
library(scales)
library(corrplot)
library(ggridges)
library(tidyr)
library(reshape2)

Loading the dataset

# Load and examine data
df <- read.csv('Telecom_Customer_Churn.csv')
head(df)
##   customerID gender SeniorCitizen Partner Dependents tenure PhoneService
## 1 7590-VHVEG Female             0     Yes         No      1           No
## 2 5575-GNVDE   Male             0      No         No     34          Yes
## 3 3668-QPYBK   Male             0      No         No      2          Yes
## 4 7795-CFOCW   Male             0      No         No     45           No
## 5 9237-HQITU Female             0      No         No      2          Yes
## 6 9305-CDSKC Female             0      No         No      8          Yes
##      MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection
## 1 No phone service             DSL             No          Yes               No
## 2               No             DSL            Yes           No              Yes
## 3               No             DSL            Yes          Yes               No
## 4 No phone service             DSL            Yes           No              Yes
## 5               No     Fiber optic             No           No               No
## 6              Yes     Fiber optic             No           No              Yes
##   TechSupport StreamingTV StreamingMovies       Contract PaperlessBilling
## 1          No          No              No Month-to-month              Yes
## 2          No          No              No       One year               No
## 3          No          No              No Month-to-month              Yes
## 4         Yes          No              No       One year               No
## 5          No          No              No Month-to-month              Yes
## 6          No         Yes             Yes Month-to-month              Yes
##               PaymentMethod MonthlyCharges TotalCharges Churn
## 1          Electronic check          29.85        29.85    No
## 2              Mailed check          56.95      1889.50    No
## 3              Mailed check          53.85       108.15   Yes
## 4 Bank transfer (automatic)          42.30      1840.75    No
## 5          Electronic check          70.70       151.65   Yes
## 6          Electronic check          99.65       820.50   Yes
dim(df)
## [1] 7043   21
str(df)
## 'data.frame':    7043 obs. of  21 variables:
##  $ customerID      : chr  "7590-VHVEG" "5575-GNVDE" "3668-QPYBK" "7795-CFOCW" ...
##  $ gender          : chr  "Female" "Male" "Male" "Male" ...
##  $ SeniorCitizen   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Partner         : chr  "Yes" "No" "No" "No" ...
##  $ Dependents      : chr  "No" "No" "No" "No" ...
##  $ tenure          : int  1 34 2 45 2 8 22 10 28 62 ...
##  $ PhoneService    : chr  "No" "Yes" "Yes" "No" ...
##  $ MultipleLines   : chr  "No phone service" "No" "No" "No phone service" ...
##  $ InternetService : chr  "DSL" "DSL" "DSL" "DSL" ...
##  $ OnlineSecurity  : chr  "No" "Yes" "Yes" "Yes" ...
##  $ OnlineBackup    : chr  "Yes" "No" "Yes" "No" ...
##  $ DeviceProtection: chr  "No" "Yes" "No" "Yes" ...
##  $ TechSupport     : chr  "No" "No" "No" "Yes" ...
##  $ StreamingTV     : chr  "No" "No" "No" "No" ...
##  $ StreamingMovies : chr  "No" "No" "No" "No" ...
##  $ Contract        : chr  "Month-to-month" "One year" "Month-to-month" "One year" ...
##  $ PaperlessBilling: chr  "Yes" "No" "Yes" "No" ...
##  $ PaymentMethod   : chr  "Electronic check" "Mailed check" "Mailed check" "Bank transfer (automatic)" ...
##  $ MonthlyCharges  : num  29.9 57 53.9 42.3 70.7 ...
##  $ TotalCharges    : num  29.9 1889.5 108.2 1840.8 151.7 ...
##  $ Churn           : chr  "No" "No" "Yes" "No" ...
colnames(df)
##  [1] "customerID"       "gender"           "SeniorCitizen"    "Partner"         
##  [5] "Dependents"       "tenure"           "PhoneService"     "MultipleLines"   
##  [9] "InternetService"  "OnlineSecurity"   "OnlineBackup"     "DeviceProtection"
## [13] "TechSupport"      "StreamingTV"      "StreamingMovies"  "Contract"        
## [17] "PaperlessBilling" "PaymentMethod"    "MonthlyCharges"   "TotalCharges"    
## [21] "Churn"
sapply(df, class)
##       customerID           gender    SeniorCitizen          Partner 
##      "character"      "character"        "integer"      "character" 
##       Dependents           tenure     PhoneService    MultipleLines 
##      "character"        "integer"      "character"      "character" 
##  InternetService   OnlineSecurity     OnlineBackup DeviceProtection 
##      "character"      "character"      "character"      "character" 
##      TechSupport      StreamingTV  StreamingMovies         Contract 
##      "character"      "character"      "character"      "character" 
## PaperlessBilling    PaymentMethod   MonthlyCharges     TotalCharges 
##      "character"      "character"        "numeric"        "numeric" 
##            Churn 
##      "character"

Data Transformation

Data Preparation and Cleaning

# Drop customerID column
df <- df %>% select(-customerID)
head(df)
##   gender SeniorCitizen Partner Dependents tenure PhoneService    MultipleLines
## 1 Female             0     Yes         No      1           No No phone service
## 2   Male             0      No         No     34          Yes               No
## 3   Male             0      No         No      2          Yes               No
## 4   Male             0      No         No     45           No No phone service
## 5 Female             0      No         No      2          Yes               No
## 6 Female             0      No         No      8          Yes              Yes
##   InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport
## 1             DSL             No          Yes               No          No
## 2             DSL            Yes           No              Yes          No
## 3             DSL            Yes          Yes               No          No
## 4             DSL            Yes           No              Yes         Yes
## 5     Fiber optic             No           No               No          No
## 6     Fiber optic             No           No              Yes          No
##   StreamingTV StreamingMovies       Contract PaperlessBilling
## 1          No              No Month-to-month              Yes
## 2          No              No       One year               No
## 3          No              No Month-to-month              Yes
## 4          No              No       One year               No
## 5          No              No Month-to-month              Yes
## 6         Yes             Yes Month-to-month              Yes
##               PaymentMethod MonthlyCharges TotalCharges Churn
## 1          Electronic check          29.85        29.85    No
## 2              Mailed check          56.95      1889.50    No
## 3              Mailed check          53.85       108.15   Yes
## 4 Bank transfer (automatic)          42.30      1840.75    No
## 5          Electronic check          70.70       151.65   Yes
## 6          Electronic check          99.65       820.50   Yes
# Convert TotalCharges to numeric
df$TotalCharges <- as.numeric(as.character(df$TotalCharges))

# Check missing values
colSums(is.na(df))
##           gender    SeniorCitizen          Partner       Dependents 
##                0                0                0                0 
##           tenure     PhoneService    MultipleLines  InternetService 
##                0                0                0                0 
##   OnlineSecurity     OnlineBackup DeviceProtection      TechSupport 
##                0                0                0                0 
##      StreamingTV  StreamingMovies         Contract PaperlessBilling 
##                0                0                0                0 
##    PaymentMethod   MonthlyCharges     TotalCharges            Churn 
##                0                0               11                0
# View rows where TotalCharges is NA
df[is.na(df$TotalCharges), ]
##      gender SeniorCitizen Partner Dependents tenure PhoneService
## 489  Female             0     Yes        Yes      0           No
## 754    Male             0      No        Yes      0          Yes
## 937  Female             0     Yes        Yes      0          Yes
## 1083   Male             0     Yes        Yes      0          Yes
## 1341 Female             0     Yes        Yes      0           No
## 3332   Male             0     Yes        Yes      0          Yes
## 3827   Male             0     Yes        Yes      0          Yes
## 4381 Female             0     Yes        Yes      0          Yes
## 5219   Male             0     Yes        Yes      0          Yes
## 6671 Female             0     Yes        Yes      0          Yes
## 6755   Male             0      No        Yes      0          Yes
##         MultipleLines InternetService      OnlineSecurity        OnlineBackup
## 489  No phone service             DSL                 Yes                  No
## 754                No              No No internet service No internet service
## 937                No             DSL                 Yes                 Yes
## 1083              Yes              No No internet service No internet service
## 1341 No phone service             DSL                 Yes                 Yes
## 3332               No              No No internet service No internet service
## 3827              Yes              No No internet service No internet service
## 4381               No              No No internet service No internet service
## 5219               No              No No internet service No internet service
## 6671              Yes             DSL                  No                 Yes
## 6755              Yes             DSL                 Yes                 Yes
##         DeviceProtection         TechSupport         StreamingTV
## 489                  Yes                 Yes                 Yes
## 754  No internet service No internet service No internet service
## 937                  Yes                  No                 Yes
## 1083 No internet service No internet service No internet service
## 1341                 Yes                 Yes                 Yes
## 3332 No internet service No internet service No internet service
## 3827 No internet service No internet service No internet service
## 4381 No internet service No internet service No internet service
## 5219 No internet service No internet service No internet service
## 6671                 Yes                 Yes                 Yes
## 6755                  No                 Yes                  No
##          StreamingMovies Contract PaperlessBilling             PaymentMethod
## 489                   No Two year              Yes Bank transfer (automatic)
## 754  No internet service Two year               No              Mailed check
## 937                  Yes Two year               No              Mailed check
## 1083 No internet service Two year               No              Mailed check
## 1341                  No Two year               No   Credit card (automatic)
## 3332 No internet service Two year               No              Mailed check
## 3827 No internet service Two year               No              Mailed check
## 4381 No internet service Two year               No              Mailed check
## 5219 No internet service One year              Yes              Mailed check
## 6671                  No Two year               No              Mailed check
## 6755                  No Two year              Yes Bank transfer (automatic)
##      MonthlyCharges TotalCharges Churn
## 489           52.55           NA    No
## 754           20.25           NA    No
## 937           80.85           NA    No
## 1083          25.75           NA    No
## 1341          56.05           NA    No
## 3332          19.85           NA    No
## 3827          25.35           NA    No
## 4381          20.00           NA    No
## 5219          19.70           NA    No
## 6671          73.35           NA    No
## 6755          61.90           NA    No
# Find and remove rows where tenure is 0
zero_tenure_indices <- which(df$tenure == 0)
df <- df[-zero_tenure_indices, ]

# Verify removal
sum(df$tenure == 0)
## [1] 0
# Fill NA values with mean
df$TotalCharges[is.na(df$TotalCharges)] <- mean(df$TotalCharges, na.rm = TRUE)

# Check missing values again
colSums(is.na(df))
##           gender    SeniorCitizen          Partner       Dependents 
##                0                0                0                0 
##           tenure     PhoneService    MultipleLines  InternetService 
##                0                0                0                0 
##   OnlineSecurity     OnlineBackup DeviceProtection      TechSupport 
##                0                0                0                0 
##      StreamingTV  StreamingMovies         Contract PaperlessBilling 
##                0                0                0                0 
##    PaymentMethod   MonthlyCharges     TotalCharges            Churn 
##                0                0                0                0
# Map SeniorCitizen values
df$SeniorCitizen <- ifelse(df$SeniorCitizen == 0, "No", "Yes")
head(df)
##   gender SeniorCitizen Partner Dependents tenure PhoneService    MultipleLines
## 1 Female            No     Yes         No      1           No No phone service
## 2   Male            No      No         No     34          Yes               No
## 3   Male            No      No         No      2          Yes               No
## 4   Male            No      No         No     45           No No phone service
## 5 Female            No      No         No      2          Yes               No
## 6 Female            No      No         No      8          Yes              Yes
##   InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport
## 1             DSL             No          Yes               No          No
## 2             DSL            Yes           No              Yes          No
## 3             DSL            Yes          Yes               No          No
## 4             DSL            Yes           No              Yes         Yes
## 5     Fiber optic             No           No               No          No
## 6     Fiber optic             No           No              Yes          No
##   StreamingTV StreamingMovies       Contract PaperlessBilling
## 1          No              No Month-to-month              Yes
## 2          No              No       One year               No
## 3          No              No Month-to-month              Yes
## 4          No              No       One year               No
## 5          No              No Month-to-month              Yes
## 6         Yes             Yes Month-to-month              Yes
##               PaymentMethod MonthlyCharges TotalCharges Churn
## 1          Electronic check          29.85        29.85    No
## 2              Mailed check          56.95      1889.50    No
## 3              Mailed check          53.85       108.15   Yes
## 4 Bank transfer (automatic)          42.30      1840.75    No
## 5          Electronic check          70.70       151.65   Yes
## 6          Electronic check          99.65       820.50   Yes
# Describe InternetService column
summary(df$InternetService)
##    Length     Class      Mode 
##      7032 character character
table(df$InternetService)
## 
##         DSL Fiber optic          No 
##        2416        3096        1520
# Define numerical columns and get summary statistics
numerical_cols <- c('tenure', 'MonthlyCharges', 'TotalCharges')
summary(df[numerical_cols])       # For missing value handling alternatives
##      tenure      MonthlyCharges    TotalCharges   
##  Min.   : 1.00   Min.   : 18.25   Min.   :  18.8  
##  1st Qu.: 9.00   1st Qu.: 35.59   1st Qu.: 401.4  
##  Median :29.00   Median : 70.35   Median :1397.5  
##  Mean   :32.42   Mean   : 64.80   Mean   :2283.3  
##  3rd Qu.:55.00   3rd Qu.: 89.86   3rd Qu.:3794.7  
##  Max.   :72.00   Max.   :118.75   Max.   :8684.8

Data Visualisation and Analysis

Gender and Churn Distribution (Donut Charts):

Gender Distribution: Displays the proportions of male and female customers, providing a demographic breakdown. Churn Distribution: Shows the proportion of customers who churned versus those who stayed. Visualizing these together gives an understanding of the churn rate within each gender group.

# Gender and Churn Distribution (Donut charts)
# Plot 1: Gender Distribution
gender_plot <- plot_ly() %>%
  add_pie(data = as.data.frame(table(df$gender)),
          labels = ~Var1,
          values = ~Freq,
          hole = 0.4,
          name = "Gender") %>%
  layout(title = "Gender Distribution",
         annotations = list(text = "Gender", 
                            x = 0.5, 
                            y = 0.5,
                            showarrow = FALSE))

# Plot 2: Churn Distribution
churn_plot <- plot_ly() %>%
  add_pie(data = as.data.frame(table(df$Churn)),
          labels = ~Var1,
          values = ~Freq,
          hole = 0.4,
          name = "Churn") %>%
  layout(title = "Churn Distribution",
         annotations = list(text = "Churn", 
                            x = 0.5, 
                            y = 0.5,
                            showarrow = FALSE))

# Arrange plots side by side
subplot(gender_plot, churn_plot)
# Churn counts by gender
churn_no <- table(df$gender[df$Churn == "No"])
churn_yes <- table(df$gender[df$Churn == "Yes"])

Customer Contract Distribution

This bar plot categorizes customers by their contract type and shows the churn rate within each category. It illustrates how contract length might impact customer retention.

# Method 1: Using basic ggplot2
ggplot(df, aes(x = Churn, fill = Contract)) +
  geom_bar(position = "dodge") +
  labs(title = "Customer Contract Distribution") +
  theme_minimal() +
  scale_fill_brewer(palette = "Set2")

Payment Method Distribution:

Visualized as a pie chart, this plot shows the distribution of different payment methods and their association with churn. Understanding payment preferences can help in customizing retention strategies.

# Payment Method Distribution
plot_ly(data = as.data.frame(table(df$PaymentMethod)),
        labels = ~Var1,
        values = ~Freq,
        type = 'pie',
        hole = 0.3) %>%
  layout(title = "Payment Method Distribution")

Customer Payment Method Distribution w.r.t. Churn

# Payment Method vs Churn
ggplot(df, aes(x = Churn, fill = PaymentMethod)) +
  geom_bar(position = "dodge") +
  ggtitle("Customer Payment Method Distribution w.r.t. Churn") +
  theme_minimal()+
  scale_fill_brewer(palette = "Set1")

Internet Service by Gender and Churn:

A grouped bar plot displays churn distribution based on internet service type across genders. This plot helps assess whether certain internet services are associated with higher churn rates.

# Internet Service Analysis
# Unique values
unique(df$InternetService)
## [1] "DSL"         "Fiber optic" "No"
# Count by gender and churn
male_counts <- table(df$InternetService[df$gender == "Male"], 
                     df$Churn[df$gender == "Male"])
female_counts <- table(df$InternetService[df$gender == "Female"], 
                       df$Churn[df$gender == "Female"])

# Internet Service by Gender and Churn
internet_data <- data.frame(
  Churn = rep(c("No", "No", "Yes", "Yes"), 3),
  Gender = rep(c("Female", "Male", "Female", "Male"), 3),
  Service = rep(c("DSL", "Fiber optic", "No Internet"), each = 4),
  Count = c(965, 992, 219, 240,  # DSL
            889, 910, 664, 633,   # Fiber optic
            690, 717, 56, 57)     # No Internet
)


ggplot(internet_data, aes(x = interaction(Churn, Gender), y = Count, fill = Service)) +
  geom_bar(stat = "identity", position = "dodge") +
  ggtitle("Churn Distribution w.r.t. Internet Service and Gender") +
  theme_minimal()+
  scale_fill_brewer(palette = "Set4")

Dependents, Partner, and Senior Citizen Distributions

  • Dependents Distribution: Shows the breakdown of churn among customers with and without dependents.
  • Partner Distribution: This plot examines the influence of having a partner on churn likelihood.
  • Senior Citizen Distribution: Displays churn rates among senior versus non-senior customers.
# Dependents Distribution
ggplot(df, aes(x = Churn, fill = Dependents)) +
  geom_bar(position = "dodge") +
  scale_fill_manual(values = c("#AB63FA", "#FF97FF")) +
  ggtitle("Dependents Distribution") +
  theme_minimal()

# Partner Distribution

ggplot(df, aes(x = Churn, fill = Partner)) +
  geom_bar(position = "dodge") +
  scale_fill_manual(values = c("#00CC96", "#FFA15A")) +
  ggtitle("Churn Distribution w.r.t. Partners") +
  theme_minimal()

# Senior Citizen Distribution
ggplot(df, aes(x = Churn, fill = SeniorCitizen)) +
  geom_bar(position = "dodge") +
  scale_fill_manual(values = c("#B6E880", "#00CC96")) +
  labs(title = "Churn Distribution w.r.t. Senior Citizen") +
  theme_minimal()

Churn Distribution with Online Security

Illustrates the role of online security in customer retention. Churn rates for customers with online security services are compared to those without.

# Alternative with more customization
ggplot(df, aes(x = Churn, fill = OnlineSecurity)) +
  geom_bar(position = "dodge") +
  scale_fill_manual(values = c("No" = "#AB63FA", 
                               "Yes" = "#FF97FF", 
                               "No internet service" = "#BFBFBB")) +
  labs(title = "Churn Distribution w.r.t. Online Security",
       x = "Churn",
       y = "Count",
       fill = "Online Security") +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, size = 14),
    legend.position = "right"
  )

Paperless Billing, Tech Support, and Phone Service

Each plot visualizes churn rates for customers who opted for these services, showing if certain service features correlate with churn tendencies.

# Paperless Billing Distribution

ggplot(df, aes(x = Churn, fill = PaperlessBilling)) +
  geom_bar(position = "dodge") +
  scale_fill_manual(values = c("#00CC96", "#FFA15A")) +
  labs(title = "Churn Distribution w.r.t. Paperless Billing") +
  theme_minimal()

# Tech Support Distribution
ggplot(df, aes(x = Churn, fill = TechSupport)) +
  geom_bar(position = "dodge") +
  labs(title = "Churn Distribution w.r.t. TechSupport") +
  theme_minimal()

# Phone Service Distribution

ggplot(df, aes(x = Churn, fill = PhoneService)) +
  geom_bar(position = "dodge") +
  scale_fill_manual(values = c("#B6E880", "#00CC96")) +
  labs(title = "Churn Distribution w.r.t. Phone Service") +
  theme_minimal()

Monthly Charges Density Plot

This density plot shows the distribution of monthly charges among churned and retained customers, highlighting any differences in spending patterns that correlate with churn.

# Monthly Charges Density Plot
ggplot(df, aes(x = MonthlyCharges, fill = Churn)) +
  geom_density(alpha = 0.5) +
  scale_fill_manual(values = c("Red", "Blue")) +
  labs(title = "Distribution of Monthly Charges by Churn",
       x = "Monthly Charges",
       y = "Density") +
  theme_minimal()

Total Charges Density Plot

Similar to monthly charges, this plot shows the distribution of total charges among churned and non-churned customers, allowing for insights into long-term customer spending.

# Total Charges Density Plot
ggplot(df, aes(x = TotalCharges, fill = Churn)) +
  geom_density(alpha = 0.5) +
  scale_fill_manual(values = c("Gold", "Green")) +
  labs(title = "Distribution of Total Charges by Churn",
       x = "Total Charges",
       y = "Density") +
  theme_minimal()

Tenure vs. Churn

A box plot compares tenure between churned and retained customers, providing a visual summary of how customer longevity might influence churn.

# Tenure Box Plot
plot_ly(df, x = ~Churn, y = ~tenure, type = "box") %>%
  layout(title = "Tenure vs Churn",
         xaxis = list(title = "Churn"),
         yaxis = list(title = "Tenure (Months)"),
         width = 750, height = 600)

Correlation Matrix

A heatmap displays the correlations between numerical variables, including the encoded categorical ones. This plot provides an overview of how different features correlate with each other and with churn.

# Correlation Matrix
# First convert categorical variables to numeric
df_numeric <- df %>%
  mutate_if(is.character, as.factor) %>%
  mutate_if(is.factor, as.numeric)

# Calculate correlation matrix
corr_matrix <- cor(df_numeric, use = "complete.obs")

# Correlation heatmap using ggplot2
ggplot(data = reshape2::melt(corr_matrix), aes(Var1, Var2, fill = value)) +
  geom_tile() +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
                       midpoint = 0, limit = c(-1,1)) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Correlation Matrix",
       x = "",
       y = "") +
  coord_fixed()

Label Encoding

# Function to convert categorical variables to numeric (Label Encoding)
object_to_int <- function(x) {
  if(is.character(x) || is.factor(x)) {
    return(as.numeric(factor(x)) - 1)  # Subtract 1 to match Python's 0-based encoding
  }
  return(x)
}

# Apply label encoding to all columns
df_encoded <- df %>%
  mutate_all(object_to_int)

# Show first few rows
head(df_encoded)
##   gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines
## 1      0             0       1          0      1            0             1
## 2      1             0       0          0     34            1             0
## 3      1             0       0          0      2            1             0
## 4      1             0       0          0     45            0             1
## 5      0             0       0          0      2            1             0
## 6      0             0       0          0      8            1             2
##   InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport
## 1               0              0            2                0           0
## 2               0              2            0                2           0
## 3               0              2            2                0           0
## 4               0              2            0                2           2
## 5               1              0            0                0           0
## 6               1              0            0                2           0
##   StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod
## 1           0               0        0                1             2
## 2           0               0        1                0             3
## 3           0               0        0                1             3
## 4           0               0        1                0             0
## 5           0               0        0                1             2
## 6           2               2        0                1             2
##   MonthlyCharges TotalCharges Churn
## 1          29.85        29.85     0
## 2          56.95      1889.50     0
## 3          53.85       108.15     1
## 4          42.30      1840.75     0
## 5          70.70       151.65     1
## 6          99.65       820.50     1

Correlations with Churn

A bar plot highlights features with the highest correlations to churn, aiding in identifying the most significant factors for predictive modeling.

# Correlation with Churn (sorted)
correlations <- cor(df_encoded)[,'Churn']
correlations_sorted <- sort(correlations, decreasing = TRUE)
print(correlations_sorted)
##            Churn   MonthlyCharges PaperlessBilling    SeniorCitizen 
##      1.000000000      0.192858218      0.191454321      0.150541053 
##    PaymentMethod    MultipleLines     PhoneService           gender 
##      0.107852015      0.038043274      0.011691399     -0.008544643 
##      StreamingTV  StreamingMovies  InternetService          Partner 
##     -0.036302722     -0.038801748     -0.047097165     -0.149981926 
##       Dependents DeviceProtection     OnlineBackup     TotalCharges 
##     -0.163128439     -0.177883195     -0.195290209     -0.199484084 
##      TechSupport   OnlineSecurity           tenure         Contract 
##     -0.282232487     -0.289050176     -0.354049359     -0.396149533
# Visualize correlations
ggplot(data = data.frame(
  variable = names(correlations_sorted),
  correlation = correlations_sorted
), aes(x = reorder(variable, correlation), y = correlation)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  theme_minimal() +
  labs(title = "Correlations with Churn",
       x = "Variables",
       y = "Correlation")

Data Resampling

# Split features and target
X <- df_encoded %>% select(-Churn)
y <- df_encoded$Churn

# Split data into training and testing sets
set.seed(40)  # for reproducibility
train_index <- createDataPartition(y, p = 0.7, list = FALSE)
X_train <- X[train_index, ]
X_test <- X[-train_index, ]
y_train <- y[train_index]
y_test <- y[-train_index]

# Function to create distribution plots
plot_distribution <- function(feature, data, color = "red") {
  ggplot(data, aes_string(x = feature)) +
    geom_density(fill = color, alpha = 0.5) +
    theme_minimal() +
    ggtitle(paste("Distribution for", feature))
}

# Plot distributions for numeric columns
num_cols <- c("tenure", "MonthlyCharges", "TotalCharges")
plots <- lapply(num_cols, function(col) {
  plot_distribution(col, df_encoded)
})

# Display plots in a grid
gridExtra::grid.arrange(grobs = plots, ncol = 2)

# Standardize numeric columns
# Create preprocessing object
preproc <- preProcess(df_encoded[num_cols], method = c("center", "scale"))

# Apply standardization
df_std <- predict(preproc, df_encoded[num_cols])

# Plot standardized distributions
std_plots <- lapply(num_cols, function(col) {
  plot_distribution(col, df_std, color = "cyan")
})

# Display standardized plots in a grid
gridExtra::grid.arrange(grobs = std_plots, ncol = 2)

# Define columns for different encoding methods
cat_cols_ohe <- c('PaymentMethod', 'Contract', 'InternetService')
cat_cols_le <- setdiff(
  setdiff(names(X_train), num_cols),
  cat_cols_ohe
)

# Standardize numeric columns in training and test sets
X_train[num_cols] <- predict(preproc, X_train[num_cols])
X_test[num_cols] <- predict(preproc, X_test[num_cols])

# Function to perform one-hot encoding
perform_ohe <- function(data, columns) {
  # Create dummy variables
  dummies <- dummyVars(~ ., data = data[columns])
  encoded <- predict(dummies, newdata = data[columns])
  
  # Convert to data frame
  encoded_df <- as.data.frame(encoded)
  
  # Remove original columns and add encoded ones
  data <- data %>%
    select(-all_of(columns)) %>%
    bind_cols(encoded_df)
  
  return(data)
}

# Apply one-hot encoding
X_train <- perform_ohe(X_train, cat_cols_ohe)
X_test <- perform_ohe(X_test, cat_cols_ohe)

# Create final preprocessed datasets
train_data <- data.frame(X_train, Churn = y_train)
test_data <- data.frame(X_test, Churn = y_test)

Model Training and Evaluation

# KNN
knn_pred <- knn(train_data[, -ncol(train_data)], test_data[, -ncol(test_data)], train_data$Churn, k = 5)
knn_conf_matrix <- confusionMatrix(as.factor(knn_pred), as.factor(test_data$Churn))
print(knn_conf_matrix)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1309  281
##          1  227  292
##                                           
##                Accuracy : 0.7591          
##                  95% CI : (0.7403, 0.7772)
##     No Information Rate : 0.7283          
##     P-Value [Acc > NIR] : 0.0007038       
##                                           
##                   Kappa : 0.3728          
##                                           
##  Mcnemar's Test P-Value : 0.0186982       
##                                           
##             Sensitivity : 0.8522          
##             Specificity : 0.5096          
##          Pos Pred Value : 0.8233          
##          Neg Pred Value : 0.5626          
##              Prevalence : 0.7283          
##          Detection Rate : 0.6207          
##    Detection Prevalence : 0.7539          
##       Balanced Accuracy : 0.6809          
##                                           
##        'Positive' Class : 0               
## 
# SVC
svc_model <- svm(Churn ~ ., data = train_data, kernel = "radial")
svc_pred <- predict(svc_model, test_data)
svc_conf_matrix <- confusionMatrix(factor(svc_pred, levels = unique(train_data$Churn)), factor(test_data$Churn, levels = unique(train_data$Churn)))
print(svc_conf_matrix)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction 0 1
##          0 0 0
##          1 0 0
##                                   
##                Accuracy : NaN     
##                  95% CI : (NA, NA)
##     No Information Rate : NA      
##     P-Value [Acc > NIR] : NA      
##                                   
##                   Kappa : NaN     
##                                   
##  Mcnemar's Test P-Value : NA      
##                                   
##             Sensitivity :  NA     
##             Specificity :  NA     
##          Pos Pred Value :  NA     
##          Neg Pred Value :  NA     
##              Prevalence : NaN     
##          Detection Rate : NaN     
##    Detection Prevalence : NaN     
##       Balanced Accuracy :  NA     
##                                   
##        'Positive' Class : 0       
## 
# Random Forest
rf_model <- randomForest(Churn ~ ., data = train_data)
rf_pred <- predict(rf_model, test_data)
rf_conf_matrix <- confusionMatrix(factor(rf_pred, levels = unique(train_data$Churn)), factor(test_data$Churn, levels = unique(train_data$Churn)))
print(rf_conf_matrix)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction 0 1
##          0 0 0
##          1 0 0
##                                   
##                Accuracy : NaN     
##                  95% CI : (NA, NA)
##     No Information Rate : NA      
##     P-Value [Acc > NIR] : NA      
##                                   
##                   Kappa : NaN     
##                                   
##  Mcnemar's Test P-Value : NA      
##                                   
##             Sensitivity :  NA     
##             Specificity :  NA     
##          Pos Pred Value :  NA     
##          Neg Pred Value :  NA     
##              Prevalence : NaN     
##          Detection Rate : NaN     
##    Detection Prevalence : NaN     
##       Balanced Accuracy :  NA     
##                                   
##        'Positive' Class : 0       
## 
#Plot ROC Curve
train_data$Churn <- as.factor(train_data$Churn)
test_data$Churn <- as.factor(test_data$Churn)

# Retrain Random Forest model with Churn as a factor
rf_model <- randomForest(Churn ~ ., data = train_data)

# Get predicted probabilities for the positive class
rf_prob <- predict(rf_model, test_data, type = "prob")[, 2]

# Plot ROC curve
roc_curve <- roc(test_data$Churn, rf_prob)
plot(roc_curve, col = "blue", lwd = 2, main = "ROC Curve for RandomForest Model")
abline(a = 0, b = 1, col = "gray", lty = 2)  # Add diagonal line for reference

# Add AUC to the plot
auc <- auc(roc_curve)
text(0.6, 0.4, paste("AUC =", round(auc, 2)), col = "blue", cex = 1.2)

# Fit a Decision Tree model
dt_model <- rpart(Churn ~ ., data = train_data, method = "class")

# Predict on the test set
dt_pred <- predict(dt_model, test_data, type = "class")

# Print the confusion matrix
dt_conf_matrix <- confusionMatrix(dt_pred, test_data$Churn)
print(dt_conf_matrix)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1422  357
##          1  114  216
##                                           
##                Accuracy : 0.7767          
##                  95% CI : (0.7583, 0.7943)
##     No Information Rate : 0.7283          
##     P-Value [Acc > NIR] : 2.025e-07       
##                                           
##                   Kappa : 0.3492          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9258          
##             Specificity : 0.3770          
##          Pos Pred Value : 0.7993          
##          Neg Pred Value : 0.6545          
##              Prevalence : 0.7283          
##          Detection Rate : 0.6743          
##    Detection Prevalence : 0.8435          
##       Balanced Accuracy : 0.6514          
##                                           
##        'Positive' Class : 0               
## 

Conclusion

Through this analysis, we identified several key drivers of churn, including monthly charges, tenure, contract type, and additional services like online security. Customers with shorter tenures, higher monthly charges, and flexible contracts were found to have higher churn rates. Moreover, the absence of online security and tech support services was also associated with increased churn.

In predictive modeling, the Random Forest classifier outperformed other models, providing a strong balance between accuracy and interpretability. This suggests that a Random Forest model could be effectively used in customer retention strategies to proactively identify at-risk customers based on these key factors.

Moving forward, the findings from this analysis could be integrated into strategic initiatives, such as offering targeted discounts or personalized service upgrades to at-risk customers. By leveraging these insights, telecom companies can reduce churn rates and enhance customer satisfaction.